ByteLevelBPETokenizer output seems weird

The merge.txt and vocab.json files I obtained are now not human readable.

The byte-level BPE converts all the Unicode code points into multiple byte-level characters:

「Unicodeコードポイント全てを複数のバイトレベルの文字に変換する」

1. Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)

2. Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table.

So some characters get other representations, like for example the white space U+0020 becomes Ġ.

半角スペース U+0020 は文字に対して別の表現が割り当てられた例